Nature Genetics — Latest Matching Preprints

1

Pod indehiscence in common bean is associated to the fine regulation of PvMYB26 and a non-functional abscission layer

Di Vittori, V.; Bitocchi, E.; Rodriguez, M.; Alseekh, S.; Bellucci, E.; Nanni, L.; Gioia, T.; Marzario, S.; Logozzo, G.; Rossato, M.; De Quattro, C.; Murgia, M. L.; Ferreira, J. J.; Campa, A.; Xu, C.; Fiorani, F.; Sampathkumar, A.; Fröhlich, A.; Attene, G.; Delledonne, M.; Usadel, B.; Fernie, A. R.; Rau, D.; Papa, R.

2020-04-04 genetics 10.1101/2020.04.02.021972 medRxiv

Top 0.1%

50.2%

Show abstract

In legumes, pod shattering occurs when mature pods dehisce along the sutures, and detachment of the valves promotes seed dispersal. In Phaseolus vulgaris, the major locus qPD5.1-Pv for pod indehiscence was identified recently. We developed a BC4/F4 introgression line population and narrowed the major locus down to a 22.5-kb region. Here, gene expression and a parallel histological analysis of dehiscent and indehiscent pods identified an AtMYB26 orthologue as the best candidate for loss of pod shattering, on a genomic region ~11 kb downstream of the highest associated peak. Based on mapping and expression data, we propose early and fine up-regulation of PvMYB26 in dehiscent pods. Detailed histological analysis establishes that pod indehiscence is associated to the lack of a functional abscission layer in the ventral sheath, and that the key anatomical modifications associated with pod shattering in common bean occur early during pod development. We finally propose that loss of pod shattering in legumes resulted from histological convergent evolution and that this is the result of selection at orthologous loci. One-sentence summaryA non-functional abscission layer determines the loss of pod shattering; mapping data, and parallel gene expression and histological analysis support PvMYB26 as the candidate gene for pod indehiscence.

2

Imputation of structural variants using a multi-ancestry long-read sequencing panel enables identification of disease associations

Noyvert, B.; Erzurumluoglu, A. M.; Drichel, D.; Omland, S.; Andlauer, T. F. M.; Mueller, S.; Sennels, L.; Becker, C.; Kantorovich, A.; Bartholdy, B. A.; Braenne, I.; Bolivar-Lopez, J. C.; Mistrellides, C.; Belbin, G. M.; Li, J. H.; Pickrell, J. K.; de Jong, J.; Arora, J.; Hu, Y.; Boehringer Ingelheim - Global Computational Biology and Digital Sciences, ; Wood, C. R.; Kriegl, J. M.; Podduturi, N.; Jensen, J. N.; Stutzki, J.; Ding, Z.

2023-12-22 genetic and genomic medicine 10.1101/2023.12.20.23300308 medRxiv

Top 0.1%

50.0%

Show abstract

Advancements in long-read sequencing technology have accelerated the study of large structural variants (SVs). We created a curated, publicly available, multi-ancestry SV imputation panel by long-read sequencing 888 samples from the 1000 Genomes Project. This high-quality panel was used to impute SVs in approximately 500,000 UK Biobank participants. We demonstrated the feasibility of conducting genome-wide SV association studies at biobank scale using 32 disease-relevant phenotypes related to respiratory, cardiometabolic and liver diseases, in addition to 1,463 protein levels. This analysis identified thousands of genome-wide significant SV associations, including hundreds of conditionally independent signals, thereby enabling novel biological insights. Focusing on genetic association studies of lung function as an example, we demonstrate the added value of SVs for prioritising causal genes at gene-rich loci compared to traditional GWAS using only short variants. We envision that future post-GWAS gene-prioritisation workflows will incorporate SV analyses using this SV imputation panel and framework.

3

Genome-wide association meta-regression identifies stem cell lineage orchestration as a key driver of acne risk

Maxwell, J.; Mitchell, B. L.; DuHarpur, X.; Pardo, L. M.; Witkam, W. C. A. M.; Dand, N.; Bartels, M.; Betti, M. J.; Boomsma, D. I.; Dong, X.; Gerring, Z.; Finer, S.; Genes & Health Research Team, ; Hagenbeek, F. A.; Hottenga, J. J.; Hripcsak, G.; Huilaja, L.; Hveem, K.; Jacobs, B. M.; Kals, M.; Kaufman-Cook, J.; Kettunen, J.; Khan, A.; Kingo, K.; Kiryluk, K.; Loset, M.; Lunter, G.; Lupton, M. K.; Min, J. L.; Martin, N. G.; Medland, S. E.; Neijzen, D.; Nijsten, T. E. C.; Nikopensius, T.; Olsen, C. M.; Petukhova, L.; Reigo, A.; Renteria, M. E.; Rispoli, R.; Saklatvala, J.; Sliz, E.; Tasanen-Maa

2025-06-28 dermatology 10.1101/2025.06.27.25330406 medRxiv

Top 0.1%

43.9%

Show abstract

Over 85% of the population experience acne at some point in their lives, with its severity spanning a quantitative spectrum, from mild, transient outbreaks to more persistent, severe forms of the condition. Moderate to severe disease poses a substantial global burden arising from both the physical and psychological impacts of this highly visible condition. The analytical approach taken in this study aimed to address the impact of variation in the dichotomisation of acne case control status, driven by ascertainment and study design, on effect size estimates across independent genetic association studies of acne. Through a fixed intercept meta-regression framework, we combined evidence genome-wide for association with acne across studies in which case-control status had been ascertained in different settings, allowing for different severity threshold definitions. Across a combined sample of 73,997 cases and 1,103,940 controls of European, South Asian and African American ancestry we identify genetic variation at 165 genomic loci that influence acne risk. There is evidence for both shared and ancestry specific components to the genetic susceptibility to acne and for sex differences in the magnitude of effect of risk alleles at three loci. We observe that common genetic variation explains 13.4% of acne heritability on the liability scale. Consistent with the hypothesis that genetic risk primarily operates at the level of individual pilosebaceous units, a polygenic score derived from this case-control study of acne susceptibility is associated with both self-reported and clinically assessed acne severity in adolescence, further strengthening the link between genetic risk and disease severity. Prioritisation of causal genes at the identified acne risk loci, provides genetic validation of the targets of established and emerging acne therapies, including retinoid treatments. The identified acne risk loci are enriched for genes encoding downstream effectors of RXRA signalling, including SOX9 and components of the WNT and p53 pathways. Illustrating that the control of stem cell lineage plasticity and cellular fate are important mechanisms through which genetic variation influences acne susceptibility within the pilosebaceous unit.

4

Cross-cohort analysis of expression and splicing quantitative trait loci in TOPMed

Orchard, P.; Blackwell, T. W.; Kachuri, L.; Castaldi, P. J.; Cho, M. H.; Christenson, S. A.; Durda, P.; Gabriel, S.; Hersh, C. P.; Huntsman, S.; Hwang, S.; Joehanes, R.; Johnson, M.; Li, X.; Lin, H.; Liu, C.-T.; Liu, Y.; Mak, A. C. Y.; Manichaikul, A. W.; Paik, D.; Saferali, A.; Smith, J. D.; Taylor, K. D.; Tracy, R. P.; Wang, J.; Wang, M.; Weinstock, J. S.; Weiss, J.; Wheeler, H. E.; Zhou, Y.; Zoellner, S.; Wu, J. C.; Mestroni, L.; Graw, S.; Taylor, M. R. G.; Ortega, V. E.; Johnson, C. W.; Gan, W.; Abecasis, G.; Nickerson, D. A.; Gupta, N.; Ardlie, K.; Woodruff, P. G.; Zheng, Y.; Bowler, R. P

2025-02-21 genetic and genomic medicine 10.1101/2025.02.19.25322561 medRxiv

Top 0.1%

41.1%

Show abstract

Most genetic variants associated with complex traits and diseases occur in non-coding genomic regions and are hypothesized to regulate gene expression. To understand the genetics underlying gene expression variability, we characterize 14,324 ancestrally diverse RNA-sequencing samples from the NHLBI Trans-Omics for Precision Medicine (TOPMed) program and integrate whole genome sequencing data to perform cis and trans expression and splicing quantitative trait locus (cis-/trans-e/sQTL) analyses in six tissues and cell types, most notably whole blood (N=6,454) and lung (N=1,291). We show this dataset enables greater detection of secondary cis-e/sQTL signals than was achieved in previous studies, and that secondary cis-eQTL and primary trans-eQTL signal discovery is not saturated even though eGene discovery is. Most TOPMed trans-eQTL signals colocalize with cis-e/sQTL signals, suggesting many trans signals are mediated by cis signals. We fine-map European UK BioBank GWAS signals from 164 traits and colocalize the resulting 34,107 fine-mapped GWAS signals with TOPMed e/sQTL signals, finding that of 10,611 GWAS signals with a colocalization, 7,096 GWAS signals colocalize with at least one secondary e/sQTL signal. These results demonstrate that larger e/sQTL analyses will continue to uncover secondary e/sQTL signals, and that these new signals will benefit GWAS interpretation.

5

Incorporating phenotype heterogeneity in disease GWAS improves power while maintaining specificity

Hof, J. J. P.; Ning, C.; Quinn, L.; Speed, D.

2026-03-27 genetic and genomic medicine 10.64898/2026.03.26.26349370 medRxiv

Top 0.1%

41.0%

Show abstract

Common complex diseases are clinically heterogeneous, yet most genome-wide association studies (GWAS) assume cases are genetically homogeneous. This challenge is compounded in large-scale biobanks, which increasingly combine cases ascertained under different recruitment strategies, raising concerns that heterogeneous case definitions may dilute genetic signal. To address this, we developed StratGWAS, a scalable framework that leverages clinical features of heterogeneity to construct a transformed phenotype that better reflects genetic liability within diseases. StratGWAS stratifies cases using secondary phenotypic information such as age of onset, medication burden, or recruitment definition. StratGWAS then estimates genetic covariance between strata, and derives a transformed phenotype that upweights cases with higher inferred genetic liability. Through simulation studies (N = 100k) and application to the UK Biobank (N = 368k), we show that StratGWAS consistently outperformed standard GWAS methods. Applied to 21 UK Biobank traits, StratGWAS upweighted individuals with earlier disease onset and higher medication burden, yielding respectively 17% and 4% more independent genome-wide significant loci than standard case control GWAS. Applied to depression, StratGWAS upweighted individuals with multiple diagnoses, greater psychiatric comorbidity, or higher self reported depressive symptoms, identifying eight additional independent loci compared to case-control GWAS.

6

The Trans-Ancestral Genomic Architecture of Glycaemic Traits

Chen, J.

2020-07-25 genetics 10.1101/2020.07.23.217646 medRxiv

Top 0.1%

40.9%

Show abstract

Glycaemic traits are used to diagnose and monitor type 2 diabetes, and cardiometabolic health. To date, most genetic studies of glycaemic traits have focused on individuals of European ancestry. Here, we aggregated genome-wide association studies in up to 281,416 individuals without diabetes (30% non-European ancestry) with fasting glucose, 2h-glucose post-challenge, glycated haemoglobin, and fasting insulin data. Trans-ancestry and single-ancestry meta-analyses identified 242 loci (99 novel; P<5x10-8), 80% with no significant evidence of between-ancestry heterogeneity. Analyses restricted to European ancestry individuals with equivalent sample size would have led to 24 fewer new loci. Compared to single-ancestry, equivalent sized trans-ancestry fine-mapping reduced the number of estimated variants in 99% credible sets by a median of 37.5%. Genomic feature, gene-expression and gene-set analyses revealed distinct biological signatures for each trait, highlighting different underlying biological pathways. Our results increase understanding of diabetes pathophysiology by use of trans-ancestry studies for improved power and resolution.

7

Cross-omic dissection reveals locus-specific heterogeneity and antagonistic pleiotropy between Alzheimer's disease and type 2 diabetes

Adewuyi, E. O.; Auta, A.; Okoh, O. S.; Selmer, K.; Gervin, K.; Nyholt, D. R.; Pereira, G.

2026-03-25 genetic and genomic medicine 10.64898/2026.03.23.26349030 medRxiv

Top 0.1%

40.9%

Show abstract

Observational studies associate type 2 diabetes (T2D) with increased dementia risk; however, the specificity of this relationship to Alzheimer's disease (AD) and its biological underpinnings remain unresolved. We apply an integrative cross-omic framework to dissect genetic links between AD and T2D. Genome-wide analyses reveal a modest positive genetic correlation and robust polygenic sign concordance of AD with T2D. High-resolution analyses demonstrate locus-specific heterogeneity, with coexisting positive and predominantly negative correlations, and strong inverse associations at APOE and HLA. Cross-trait GWAS meta-analyses indicate that most genome-wide significant signals reflect trait-specific effects, with only a limited set of variants supported in both AD and T2D. Colocalisation reveals distinct causal variants at most shared loci. Gene-based analyses highlight convergence at functional genes, including PLEKHA1, VKORC1, ACE, and APOE, without implying concordant variant-level effects. Bidirectional Mendelian randomisation (MR) shows no evidence of a causal relationship between AD and T2D in either direction. Summary-data MR prioritises genes whose expression or methylation affects both AD and T2D, mostly with opposing effects. Only PLEKHA1 (eQTL) and CAMTA2 (mQTL) show concordant positive associations. Five genes, GALNT10, HSD3B7, BCKDK, KAT8, and ACE, are supported across both regulatory layers, while numerous signals cluster within a regulatory hotspot at 16p11.2, supporting convergent transcriptional and epigenetic involvement, despite directional divergence. These results refine the AD-T2D relationship; rather than a simple shared-risk model, overlap reflects locus-specific heterogeneity and cross-omic convergence often showing opposing effects on AD versus T2D risk, consistent with antagonistic pleiotropy.

8

A Treatment-Naive Cellular Atlas of Pediatric Crohn's Disease Predicts Disease Severity and Therapeutic Response

Zheng, H. B.; Doran, B. A.; Kimler, K.; Yu, A.; Tkachev, V.; Niederlova, V.; Cribbin, K.; Fleming, R.; Bratrude, B.; Betz, K.; Cagnin, L.; McGuckin, C.; Keskula, P.; Albanese, A.; Sacta, M.; de Sousa Casal, J.; Taliaferro, F.; Ford, M.; Ambartsumyan, L.; Suskind, D. L.; Lee, D.; Deutsch, G.; Deng, X.; Collen, L. V.; Mitsialis, V.; Snapper, S. B.; Wahbeh, G.; Shalek, A. K.; Ordovas-Montanes, J.; Kean, L. S.

2021-09-22 gastroenterology 10.1101/2021.09.17.21263540 medRxiv

Top 0.1%

40.3%

Show abstract

Crohns disease is an inflammatory bowel disease (IBD) commonly treated through anti-TNF blockade. However, most patients still relapse and inevitably progress. Comprehensive single-cell RNA-sequencing (scRNA-seq) atlases have largely sampled patients with established treatment-refractory IBD, limiting our understanding of which cell types, subsets, and states at diagnosis anticipate disease severity and response to treatment. Here, through combining clinical, flow cytometry, histology, and scRNA-seq methods, we profile diagnostic human biopsies from the terminal ileum of treatment-naive pediatric patients with Crohns disease (pediCD; n=14), matched repeat biopsies (pediCD-treated; n=8) and from non-inflamed pediatric controls with functional gastrointestinal disorders (FGID; n=13). To resolve and annotate epithelial, stromal, and immune cell states among the 201,883 baseline single-cell transcriptomes, we develop a principled and unbiased tiered clustering approach, ARBOL. Through flow cytometry and scRNA-seq, we observe that treatment-naive pediCD and FGID have similar broad cell type composition. However, through high-resolution scRNA-seq analysis and microscopy, we identify significant differences in cell subsets and states that arise during pediCD relative to FGID. By closely linking our scRNA-seq analysis with clinical meta-data, we resolve a vector of T cell, innate lymphocyte, myeloid, and epithelial cell states in treatment-naive pediCD (pediCD-TIME) samples which can distinguish patients along the trajectory of disease severity and anti-TNF response. By using ARBOL with integration, we position repeat on-treatment biopsies from our patients between treatment-naive pediCD and on-treatment adult CD. We identify that anti-TNF treatment pushes the pediatric cellular ecosystem towards an adult, more treatment-refractory state. Our study jointly leverages a treatment-naive cohort, high-resolution principled scRNA-seq data analysis, and clinical outcomes to understand which baseline cell states may predict Crohns disease trajectory.

9

Phenome-wide association of multiallelic copy number variation in 422,170 UK Biobank individuals reveals novel genetic loci associated with disease

Eisenberg, M.; Packer, R.; Shrine, N.; Demidov, G.; Pack, H.; Hollox, E. J.; Fawcett, K.

2026-06-04 genetic and genomic medicine 10.64898/2026.06.03.26354825 medRxiv

Top 0.1%

39.9%

Show abstract

The contribution of multi-allelic CNVs (mCNVs) to disease risk has not been widely studied. This is largely because they have been difficult to characterise at a large-scale genome-wide, and are often not strongly associated with flanking SNVs, limiting imputation. Improved understanding of the role of mCNVs in disease risk could lead to novel insights into the pathobiology of disease. We robustly typed 69 mCNVs from UK Biobank whole exome sequences in discovery (n=150,682) and replication sets (n=269,317). Discovery and replication PheWAS used clinically-curated composite phenotypes by integrating self-report, primary and secondary health care data to interrogate these variants, for unrelated British individuals of African, European and Central/South Asian ancestries. 173 mCNV-phenotype associations were detected from 26 mCNVs, of which 114 associations replicated. One of eight potentially novel mCNV-phenotype signals was independent of neighbouring associated SNVs, the association of Sulfotransferase 1A1 and 1A2 genes (SULT1A1/SULT1A2) with estimated glomerular filtration rate (eGFR) in individuals of European ancestry (meta-analysed p=1.05x10-9, beta=0.016 [0.011; 0.021]). Other potentially novel associations include Golgi phosphoprotein 3 (GOLPH3) with the cardiovascular phenotype bundle branch block in individuals of South Asian ancestry (meta-analysed p=3.35x10-6, OR=2.13 [1.53, 2.96]) and alpha amylase 2B (AMY2B) with ventricular fibrillation and flutter in individuals of European ancestry (meta-analysed p=2.48x10-6, OR=1.50 [1.26; 1.78]). In summary, we show that accurate typing of biobank-scale sample sizes can identify associations between traits and mCNVs, acting through a gene dosage relationship. Our work provides several novel likely causative variants contributing to particular traits of clinical importance and immediately suggest a putative functional mechanism for the observed associations.

10

Mapping disease loci to biological processes via joint pleiotropic and epigenomic partitioning

Kerner, G.; Kamitaki, N.; Strober, B.; Price, A. L.

2025-05-06 genetic and genomic medicine 10.1101/2025.05.05.25327017 medRxiv

Top 0.1%

39.7%

Show abstract

Genome-wide association studies (GWAS) have identified thousands of disease-associated loci, yet their interpretation remains limited by the heterogeneity of underlying biological processes. We propose Joint Pleiotropic and Epigenomic Partitioning (J-PEP), a clustering framework that integrates pleiotropic SNP effects on auxiliary traits and tissue-specific epigenomic data to partition disease-associated loci into biologically distinct clusters. To benchmark J-PEP against existing methods, we introduce a metric--Pleiotropic and Epigenomic Prediction Accuracy (PEPA)--that evaluates how well the clusters predict SNP-to-trait and SNP-to-tissue associations using off-chromosome data, avoiding overfitting. Applying J-PEP to GWAS summary statistics for 165 diseases/traits (average N=290K), we attained 16-30% higher PEPA than pleiotropic or epigenomic partitioning approaches with larger improvements for well-powered traits, consistent with simulations; these gains arise from J-PEPs tendency to upweight correlated structure--signals present in both auxiliary trait and tissue data--thereby emphasizing shared components. For type 2 diabetes (T2D), J-PEP identified clusters refining canonical pathological processes while revealing underexplored immune and developmental signals. For hypertension (HTN), J-PEP identified stromal and adrenal-endocrine processes that were not identified in prior analyses. For neutrophil count, J-PEP identified hematopoietic, hepatic-inflammatory, and neuroimmune processes, expanding biological interpretation beyond classical immune regulation. Notably, integrating single-cell chromatin accessibility data refined bulk-based clusters, enhancing cell-type resolution and specificity. For T2D, single-cell data refined a bulk endocrine cluster to pancreatic islet {beta}-cells, consistent with established {beta}-cell dysfunction in insulin deficiency; for HTN, single-cell data refined a bulk endocrine cluster to adrenal cortex cells, consistent with a GO enrichment for neutrophil-mediated inflammation that implicates feedback between aldosterone production in the adrenal gland and local immune signaling. In conclusion, J-PEP provides a principled framework for partitioning GWAS loci into interpretable, tissue-informed clusters that provide biological insights on complex disease.

11

Open Targets Genetics: An open approach to systematically prioritize causal variants and genes at all published GWAS trait-associated loci

Mountjoy, E.; Schmidt, E. M.; Carmona, M.; Peat, G.; Miranda, A.; Fumis, L.; Hayhurst, J.; Buniello, A.; Schwartzentruber, J.; Karim, M. A.; Wright, D.; Hercules, A.; Papa, E.; Fauman, E.; Barrett, J. C.; Todd, J. A.; Ochoa, D.; Dunham, I.; Ghoussaini, M.

2020-09-17 genetics 10.1101/2020.09.16.299271 medRxiv

Top 0.1%

39.6%

Show abstract

Genome-wide association studies (GWAS) have identified many variants robustly associated with complex traits but identifying the gene(s) mediating such associations is a major challenge. Here we present an open resource that provides systematic fine-mapping and protein-coding gene prioritization across 133,441 published human GWAS loci. We integrate diverse data sources, including genetics (from GWAS Catalog and UK Biobank) as well as transcriptomic, proteomic and epigenomic data across many tissues and cell types. We also provide systematic disease-disease and disease-molecular trait colocalization results across 92 cell types and tissues and identify 729 loci fine-mapped to a single coding causal variant and colocalized with a single gene. We trained a machine learning model using the fine mapped genetics and functional genomics data using 445 gold standard curated GWAS loci to distinguish causal genes from background genes at the same loci, outperforming a naive distance based model. Genes prioritized by our model are enriched for known approved drug targets (OR = 8.1, 95% CI: [5.7, 11.5]). These results will be regularly updated and are publicly available through a web portal, Open Targets Genetics (OTG, http://genetics.opentargets.org), enabling users to easily prioritize genes at disease-associated loci and assess their potential as drug targets.

12

Pan-UK Biobank GWAS improves discovery, analysis of genetic architecture, and resolution into ancestry-enriched effects

Karczewski, K. J.; Gupta, R.; Kanai, M.; Lu, W.; Tsuo, K.; Wang, Y.; Walters, R. K.; Turley, P.; Callier, S.; Baya, N.; Palmer, D. S.; Goldstein, J. I.; Sarma, G.; Solomonson, M.; Cheng, N.; Bryant, S.; Churchhouse, C.; Cusick, C. M.; Poterba, T.; Compitello, J.; King, D.; Zhou, W.; Seed, C.; Finucane, H. K.; Daly, M. J.; Neale, B. M.; Atkinson, E. G.; Martin, A. R.

2024-03-15 genetic and genomic medicine 10.1101/2024.03.13.24303864 medRxiv

Top 0.1%

39.6%

Show abstract

Large biobanks, such as the UK Biobank (UKB), enable massive phenome by genome-wide association studies that elucidate genetic etiology of complex traits. However, individuals from diverse genetic ancestry groups are often excluded from association analyses due to concerns about population structure introducing false positive associations. Here, we generate mixed model associations and meta-analyses across genetic ancestry groups, inclusive of a larger fraction of the UKB than previous efforts, to produce freely-available summary statistics for 7,266 traits. We build a quality control and analysis framework informed by genetic architecture. Overall, we identify 14,676 significant loci (p < 5 x 10-8) in the meta-analysis that were not found in the EUR genetic ancestry group alone, including novel associations for example between CAMK2D and triglycerides. We also highlight associations from ancestry-enriched variation, including a known pleiotropic missense variant in G6PD associated with several biomarker traits. We release these results publicly alongside FAQs that describe caveats for interpretation of results, enhancing available resources for interpretation of risk variants across diverse populations.

13

Large-scale trans-ethnic replication and discovery of genetic associations for rare diseases with self-reported medical data

Shringarpure, S. S.; Wang, W.; Jiang, Y.; Acevedo, A.; Dhamija, D.; Cameron, B.; Jubb, A.; Yue, P.; The 23andMe Research Team, ; Sarov-Blat, L.; Gentleman, R.; Auton, A.

2021-06-16 genetic and genomic medicine 10.1101/2021.06.09.21258643 medRxiv

Top 0.1%

39.5%

Show abstract

A key challenge in the study of rare disease genetics is assembling large case cohorts for well-powered studies. We demonstrate the use of self-reported diagnosis data to study rare diseases at scale. We performed genome-wide association studies (GWAS) for 33 rare diseases using self-reported diagnosis phenotypes and re-discovered 29 known associations to validate our approach. In addition, we performed the first GWAS for Duane retraction syndrome, vestibular schwannoma and spontaneous pneumothorax, and report novel genome-wide significant associations for these diseases. We replicated these novel associations in non-European populations within the 23andMe, Inc. cohort as well as in the UK Biobank cohort. We also show that mixed model analyses including all ethnicities and related samples increase the power for finding associations in rare diseases. Our results, based on analysis of 19,084 rare disease cases for 33 diseases from 7 populations, show that large-scale online collection of self-reported data is a viable method for discovery and replication of genetic associations for rare diseases. This approach, which is complementary to sequencing-based approaches, will enable the discovery of more novel genetic associations for increasingly rare diseases across multiple ancestries and shed more light on the genetic architecture of rare diseases.

14

Frequency enrichment of coding variants in a French-Canadian founder population and its implication for inflammatory bowel diseases

Bherer, C.; Grenier, J.-C.; Pelletier, J.; Boucher, G.; Gagnon, G.; Goyette, P.; Ashton-Beaucage, D.; Stevens, C.; Battat, R.; Bitton, A.; Campeau, P.; Laprise, C.; Huang, H.; Daly, M. J.; Taliun, D.; Hussin, J. G.; Mooser, V.; Rioux, J. D.

2025-07-14 genetic and genomic medicine 10.1101/2025.07.11.25331388 medRxiv

Top 0.1%

39.5%

Show abstract

1The genetic features of founder populations with recent bottlenecks, causing some deleterious variants to rise to higher frequencies, can enhance the power of rare variant association studies. French Canadians from Quebec represent a recent founder population with a particular disease heritage comprising more than 30 prevalent Mendelian conditions. Here, we characterize coding variation in this founder population using exome sequencing data from 2,820 French-Canadian participants - patients with inflammatory bowel diseases (IBD), parents and controls from the Quebec IBD cohort. We find that 18% of rare coding variants are 10-100 times more frequent than in non-Finnish Europeans (NFE). A total of 4,133 missense and loss-of-function variants were significantly enriched with a median 28-fold enrichment, revealing the potential for genotype-phenotype associations in this population. We describe significantly enriched pathogenic variants, including those known to account for the increased prevalence of rare diseases in FC compared to other European descent populations, such as Agenesis of corpus callosum and peripheral neuropathy (SLC12A6) and Leigh Syndrome French Canadian type (LRPPRC). Finally, we investigate whether rare protein-coding variants, enriched in French Canadians by the founder effect, contribute to the risk of IBD using trio and case/control cohorts. In addition to replicating associations in NOD2 and IL23R, we identified new candidate association signals, including enriched variants in SLC35E3, and ARSA. Our findings show that, even in well-characterized founder populations like the French Canadians, there remains untapped potential for genetic discovery, revealing both rare and complex disease risk factors through enriched coding variation.

15

CYClones: A highly powered, fully genotyped, 8-parent yeast mapping population

Cromie, G.; Lo, R.; Morgan, T. S.; Clark, A.; Ashmead, J.; Timour, M. S.; Sirr, A.; Akey, J. M.; Dudley, A. M.

2025-10-16 genetics 10.1101/2025.10.15.682626 medRxiv

Top 0.1%

39.4%

Show abstract

The budding yeast Saccharomyces cerevisiae is a remarkably adaptable organism that thrives in diverse environments. Global sequencing of natural isolates has revealed extensive genetic diversity within the species. Here, we describe the construction and characterization of CYClones (Collaborative Yeast Cross clones), a library of 11,392 segregants generated from a multiparent funnel cross of eight genetically diverse parental strains. To enable the genetic dissection of complex traits, we imputed whole-genome sequences for all segregants and show that CYClones captures a substantial fraction of the global genetic diversity of S. cerevisiae. Haplotype representation is well maintained, with each parental haplotype present at >5% frequency across >95% of the genome. Simulations demonstrate that CYClones has [≥]95% power to detect variants with heritability as low as 0.36%, with mapping resolution often finer than the length of a single gene. In summary, CYClones is a powerful community resource for dissecting the genetic architecture of complex and quantitative traits, uncovering context-dependent mutational effects, and identifying causal variants underlying phenotypic diversity.

16

Inferring causal cell types of human diseases and risk variants from candidate regulatory elements

Kim, A.; Zhang, Z.; Legros, C.; Lu, Z.; de Smith, A.; Moore, J.; Mancuso, N.; Gazal, S.

2024-05-18 genetic and genomic medicine 10.1101/2024.05.17.24307556 medRxiv

Top 0.1%

39.0%

Show abstract

The SNP-heritability of human diseases is extremely enriched in candidate regulatory elements (cREs) from disease-relevant cell types. Critical next steps are to understand whether these enrichments are driven by multiple causal cell types and whether individual variants impact disease risk via a single or multiple of cell types. Here, we propose CT-FM and CT-FM-SNP, 2 methods accounting for cREs shared across cell types to identify independent sets of causal cell types for a trait and its candidate causal variants, respectively. We applied CT-FM to 63 GWAS summary statistics (average N = 417K) using 924 cRE annotations, primarily from ENCODE4. CT-FM inferred 79 sets of causal cell types, with corresponding SNP-annotations explaining 39.0 {+/-} 1.8% of trait SNP-heritability. It identified 14 traits with independent causal cell types, uncovering previously unexplored cellular mechanisms in height, schizophrenia and autoimmune diseases. We applied CT-FM-SNP to 39 UK Biobank traits and predicted high-confidence causal cell types for 3,091 candidate causal non-coding SNPs-trait pairs. Our results suggest that most SNPs affect a phenotype via a single set of cell types, whereas pleiotropic SNPs might target different cell types depending on the phenotype context. Altogether, CT-FM and CT-FM-SNP shed light on how genetic variants act collectively and individually at the cellular level to affect disease risk.

17

Clinical and Biological Stratification in 121,560 Antidepressant Prescription Trajectories using Unsupervised Modelling and Clustering

Herrero Zazo, M.; Fitzgerald, T. W.; Banasik, K.; Louloudis, I.; Vassos, E.; Colon-Ruiz, C.; Segura-Bedmar, I.; Kessing, L. V.; Ostrowski, S. R.; Pedersen, O. B.; Schork, A.; Sorensen, E.; Ullum, H.; Werge, T.; Bruun, M. T.; Christoffersen, L. A.; Didriksen, M.; Erikstrup, C.; Aagaard, B.; Mikkelsen, C.; DBDS Genomic Consortium, ; Lewis, C.; Brunak, S.; Birney, E.

2024-12-20 psychiatry and clinical psychology 10.1101/2024.12.17.24319152 medRxiv

Top 0.1%

38.0%

Show abstract

Major depressive disorder is a complex condition with diverse presentations and polygenic underpinnings. Leveraging large biobanks linked to primary care prescription data, we developed a data-driven approach based on antidepressant prescription trajectories for patient stratification and novel phenotype identification. We extracted quantitative prescription trajectories for 56,951 UK Biobank (UKB) and 64,609 Danish National Biobank (CHB+DBDS) individuals. Using Hidden Markov Models and K-means clustering, we identified five and six patient clusters, respectively. Multinomial logistic regression and non-parametric association tests, using clinical information, enabled patient group characterization. We consistently identified three common patient groups across cohorts: first, a majority group of individuals with mild to moderate depression; second, those with severe mental illness (i.e., a group with a higher likelihood of psychiatric diagnoses, such as bipolar depression, with odds ratios: ORUKB = 1.87 [95% CI = 1.48, 2.35], p = 2.7e-6; ORCHB+DBDS = 1.69 [95% CI = 1.41, 2.02], p = 2.3e-7); and third, patients with less severe forms of depression or receiving treatment for conditions other than depression (i.e., a group with a lower likelihood of depression diagnosis: ORUKB = 0.80 [95% CI = 0.74, 0.85], p = 3e-10; ORCHB+DBDS = 0.77 [95% CI = 0.73, 0.82], p < 1e-10). Genome-wide association studies (GWAS) revealed 14 significant loci, including USP4 and BCHE on chromosome 3, as well as a locus associated with the drug metabolising enzyme CYP2D6. These findings, and the reproducibility across cohorts, demonstrate the power of unsupervised phenotyping from primary care prescriptions for patient stratification and pharmacogenetics research.

18

Resolving inflammatory bowel disease risk variants to genes and cell types

Fachal, L.; Zhang, R.; Gettler, K.; Haritunians, T.; Cleynen, I.; Stevens, C. R.; Zhang, Q.; Tastad, C.; Medici, C.; Do, R.; IIBDGC GWAS Group, ; Abreu, M. T.; Achkarj, J.-P.; Ahmad, T.; Bel Kok, K.; Bernstein, C.; Brooks, J.; Bujanda, L.; Butterworth, J.; Clark, K.; Cummings, F.; D'Amato, M.; Del Buono, J.; Duerr, R. H.; Ellinghaus, D.; Foley, S.; Franchimont, D.; Franke, A.; Hancock, L.; Hart, A.; Hooper, P.; Irving, P.; Jarvis, M.; Johnston, E.; Julia, A.; Kemp, C.; Kennedy, N.; Kupcinskas, J.; Latiano, A.; Lewis, J.; Li, A.; Limdi, J.; Louis, E.; McLaughlin, J.; Moayyedi, P.; Moran, G.; M

2026-05-18 genetic and genomic medicine 10.64898/2026.05.13.26352926 medRxiv

Top 0.1%

37.9%

Show abstract

Inflammatory bowel diseases (IBD), principally Crohn's disease (CD) and ulcerative colitis (UC), are common chronic disorders involving inflammation and often progressive tissue damage. Genome-wide association studies have mapped many risk signals, but the causal variants, effector genes and relevant cellular contexts remain difficult to resolve, limiting mechanistic interpretation and therapeutic translation. Here we performed a multi-ancestry GWAS meta-analysis of 125,992 individuals with IBD and more than 1.2 million controls, identifying 619 independent association signals (374 novel) at 420 IBD regions that account for 77-80% of SNP-based heritability. Fine-mapping resolved 81 high-confidence variants, 41 not previously reported. Although most signals were shared between CD and UC, 39% showed subtype specificity, with UC signals showing stronger enrichment in functional annotations from intestinal epithelial, secretory and enteroendocrine cells, and CD showing stronger genetic correlations with circulating inflammatory biomarkers, including C-reactive protein and glycoprotein acetylation. Latent causal modelling supported a causal effect of decreased high-density lipoprotein on CD risk. By integrating bulk and single-cell eQTL and pQTL resources using colocalisation and Mendelian randomisation, together with coding-variant evidence from exome sequencing, we prioritised 664 candidate effector genes across 341 signals, including 390 newly implicated IBD genes, revealing new biological mechanisms and candidate therapeutic targets supported by human genetics.

19

Quantitative trait loci mapping of gene expression and chromatin accessibility in primary fibroblast reveals shared allelic effects between Latin American and European ancestries

Boltz, T.; Bot, M.; Lapinska, S.; Schwarz, T.; Hou, K.; Garske, K. M.; Freund, M. K.; Bearden, C. E.; Macaya, G.; Lopez-Jaramillo, C.; Freimer, N. B.; Boks, M. P.; Kahn, R. S.; Pasaniuc, B.; Ophoff, R. A.

2025-06-10 genomics 10.1101/2025.06.09.658613 medRxiv

Top 0.1%

37.7%

Show abstract

Quantitative Trait Locus (QTL) analysis of molecular data has identified genetic variants associated with traits such as gene expression, and colocalization of these functional QTL with GWAS risk loci has offered insights into the genetic basis of human disease. We employed gene expression (RNA-seq) and chromatin accessibility (ATAC-seq) obtained from human primary fibroblasts to investigate quantitative trait loci (QTLs) in cohorts ascertained for bipolar disorder of European (n=150) and Latin American (n=96) ancestries. Leveraging data from three countries of origin (The Netherlands, Colombia, Costa Rica) within our cohort, we characterized differences among individuals at the SNP, gene, and accessible-chromatin levels to compute ancestry-specific expression (e)QTLs and chromatin-accessibility (ca)QTLs. Across ancestries, we observed R{superscript 2} [≥] 0.93 for eQTL effect sizes and R{superscript 2} [≥] 0.95 for caQTLs, indicating a high degree of concordance. Integrating chromatin data with expression and genotype information enabled precise fine-mapping of eQTLs, yielding 203 high-confidence (posterior probability > 90 %) regulatory pathways. In downstream analyses, transcriptome-wide (TWAS) and chromatin-wide (CWAS) association studies with brain- and skin-related GWAS identified 36 TWAS-significant genes and 77 CWAS-significant open chromatin regions. These findings underscore the shared genetic regulatory mechanisms across European and Latin American ancestries, while demonstrating that ancestry-specific reference panels enhance the accuracy of TWAS and CWAS in diverse populations.

20

Comprehensive gene heritability estimation reveals the genetic architecture of rare coding variants underlying complex traits

Liu, Z.; Fu, B.; Jeong, M.; Anand, P.; Anand, A.; Jang, S.-K.; Gorla, A.; Zhu, J.; Pajukanta, P.; Palamara, P. F.; Zaitlen, N.; Border, R.; Sankararaman, S.

2025-10-08 genetics 10.1101/2025.10.07.681018 medRxiv

Top 0.1%

37.5%

Show abstract

Whole-exome sequencing (WES) enables high-resolution interrogation of the contribution of rare coding variants to complex trait variation. However, existing methods for heritability estimation attributed to rare-coding variants are often limited by the effects of linkage disequilibrium (LD) and by the sparse nature of rare variant data. We introduce FLEX (Fast, LD-aware Estimation of eXome-wide and gene-level heritability), a scalable and flexible framework for estimating and partitioning heritability across genes or sets of genes using WES data. FLEX integrates all coding variants- from common to ultra-rare - within a unifled model and corrects for LD-induced effects to improve the accuracy of heritability estimates. In addition, FLEX supports both individual-level and summary statistic data and is computationally efflcient for biobank-scale datasets. Through extensive simulations, we show that FLEX is well-calibrated while providing accurate heritability estimates. We applied FLEX to WES data across N = 153, 351 unrelated European ancestry individuals and 20 quantitative traits in the UK Biobank. We identifled 64 gene-trait pairs with signiflcant gene-level heritability (p < 0.05/18, 624 accounting for the number of protein-coding genes tested), among which rare coding variants explained 38% of gene-level heritability, on average. Compared to heritability estimates from genome-wide imputed SNPs, incorporation of rare and ultra-rare coding variants led to a 24.8% increase in heritability on average, while effect sizes at rare and ultra-rare variants are substantially larger ({approx} 18x on average). Partitioning across variant effect annotations, we flnd that predicted loss-of-function variants had stronger individual effects than missense variants (24% on average) while missense variants accounted for a greater share of rare coding heritability. Together, FLEX provides an adaptable and accurate approach for quantifying gene-level heritability, advancing our understanding of the genetic architecture of complex traits, and facilitating the discovery of trait-relevant genes.